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Abstract. We consider here the problem of chaining seeds in ordered 
trees. Seeds are mappings between two trees Q and T and a chain is a 
subset of non overlapping seeds that is consistent with respect to postfix 
order and ancestrality. This problem is a natural extension of a similar 
problem for sequences, and has applications in computational biology, 
^^r\ such as mining a database of RNA secondary structures. For the chaining 

problem with a set of m constant size seeds, we describe an algorithm 
with complexity 0(m 2 log(m)) in time and 0(m 2 ) in space. 



1 Introduction 



Comparing sequences is a basic task in computational biology, either for mining 
genomics database, or for filtering large sequence datasets. The exponential in- 
crease of available sequence data motivates the need for very efficient sequence 
Q*\ . comparison algorithms. A fundamental application of sequence comparison is 

^^ ' to search efficiently in a database a set of sequences close to a query sequence. 

f~^. • Indeed, the pairwise comparison between the query and every sequence of the 

f^ ' database cannot practically be applied due to the quadratic time complexity of 

edit distance computation. A typical approach to tackle this issue is to rely on 
short sequences, called seeds, present in the query. These seeds can be detected 
very quickly in the database using indexing techniques, then a maximal set of 
seeds, called a chain, that tiles both the query and a sequence of the database, 
must be identified while conserving the same order in both sequences. Widely 
used programs such as BLAST [T] and FASTA [10113] rely on such an approach. 
We refer the reader to [216] for surveys of sequence comparison in computa- 
tional biology. From an algorithmic point of view, an optimal chain between 
two sequences, given m seeds, can be computed in 0(mlog(m)) time and 0(m) 
space [9] (see [12] for a recent survey). 

With the recent development of high-throughput genome annotation meth- 
ods, similar problems appear to be relevant for the analysis of more complex 
biological structures. For instance, RNA secondary structures can be modeled 
by a tree or a graph whose nodes are the nucleotides and whose edges are the 
chemical bonds between them [14]. Large databases have been constituted for 
this kind of biological data, such as Rfam [5] . Comparing and mining large RNA 



secondary structure databases is now an important computational biology prob- 
lem. The initial approach to these problems relied on extensions of the notion 
of edit distance to ordered trees, pioneered by Zhang and Shasha |T5]. The tree 
edit approach has been extended in several ways since then, leading cither to 
hard problems, when a comprehensive set of edit operations is considered [8], or 
to algorithms with a time complexity, at best cubic, even with a minimal set of 
edit operations |4I15] . 

Recently, Heyne et al. [7 introduced a chaining problem on another repre- 
sentation of ordered trees called arc annotated sequences, that they solved using 
dynamic programming. Their seeds are exact common patterns and they applied 
their algorithm for RNA secondary structure comparison: once a maximal chain 
of seeds between two given RNA secondary structures is detected, the regions 
between successive seeds are processed independently using an edit distance al- 
gorithm, which speeds up significantly the comparison process. From what we 
know so far, [7] is the first paper addressing a chaining problem in trees. How- 
ever, when applied for chaining seeds in sequences, their algorithm complexity is 
higher than the best-known chaining algorithms for sequences, which raises the 
question of a more efficient algorithm, of both theoretical and practical interest. 

After some preliminaries (Section [5] and |3]), we describe in Section 0] an algo- 
rithm for finding the score of a maximum-score chain between two ordered trees 
(Maximal Chaining Problem) in 0(m 2 log(m)) time and 0(m 2 ) space when there 
are m seeds of constant size, thus improving on the result of Heyne et al. [7] . We 
conclude with further research avenues. 



2 Background and problem statement 

Let T be an ordered rooted tree of size n. Nodes of T are identified with their 
postfix-order index from to n — 1. Thus, n — 1 represents the root of T. Tj 
is the subtree of T rooted at i. We denote by T[i,j] the forest induced by the 
nodes that belong to the interval [i,j]] if f > j, then T[i, j] is empty. The partial 
relationship "i is an ancestor of j" is denoted by i -< j. For a tree T and a node 
i of T, the first leaf visited during a postfix traversal of Ti is denoted by l(i) and 
called the leftmost leaf of the node i. The ordered forest induced by the proper 
descendants of i is denoted by T, = T[l(i),i — 1]. 

Definition 1. Let T be an ordered rooted tree: 

1. Let G = {go, ■ • ■ , gk-i} be an ordered set of k nodes of T, with < gj < n. 
If the subgraph of T induced by G is connected, then G is called an internal 
tree rooted at g^-i also referred as re- 

2. The set of leaves of the internal tree G is denoted by L(G). 

3. A node gj of G is said to be completely inside G if gj is not a leaf of T and 
all its children belong to G. The set of nodes of G that are not completely 
inside G is called the border of G and is denoted by B(G). 

4. Two internal trees G 1 and G 2 overlap if they share at least one node, i.e. 
G 1 n G 2 ^ 0. 



We now recall the central notion of valid mapping between two trees intro- 
duced in [T3] for the tree edit distance. Given two trees Q and T, a valid mapping 
P between Q and T is a set of pairs of Q x T such that, if (qi,ti) and {qj,tj) 
belong to P, then 

1 . qi — qj if and only if t% = tj , 

2. qi < qj if and only if t% < tj, 

3. qi -< qj if and only if t% -< tj. 

In the following we use the term mapping to refer a valid mapping. Given a 
mapping P between Q and T, the smallest internal tree of Q (resp. T) that 
contains all nodes of Q (resp. T) belonging to a pair of P is denoted by Qp 
(resp. Tp). Qp and Tp are respectively called the internal trees of Q and T 
induced by P. 

Definition 2. Let Q and T be two ordered trees. 

1. A seed M. between Q and T is a mapping between Q and T such that 
{tq m , Pr^ ) G .M and all the nodes of the border of Qm (resp. T^) belong 
to a pair of .M. 

2. The border (resp. leaves) B{M) (resp. L{M)) of the seed M. is the set of 
pairs (x, y) G .M such that x G B{Qm) and j/ G B{Tm) (resp. a; G L{Qm) 
and y G L(Tm)). 

3. The size \M\ of the seed tV( is the number of pairs its mapping contains. 

4. For a set S of seeds, ||S|| is the sum of the sizes of the 151 seeds in S. 

Definition 3. Let Q and T be two ordered trees. 

1. A pair (P 1 ^ 2 ) of seeds between Q and T is chainable if Qpi does not 
overlap Qp?, T P i does not overlap T P i, and P 1 U P 2 is a mapping. 

2. A chain is a set C — {P°, P 1 ,. . . , P 1 ^ 1 } of seeds between Q and T such that 
any pair (P 1 ^^) of distinct seeds in C is chainable. 

3. Given a scoring function w for the seeds P % , the score of a chain C is the 
sum of the scores of its seeds: v(C) = ^2 t v(P l ). 

4. Given a set S of possibly overlapping seeds between Q and T, Cs(Q,T) 
denotes the set of all possible chains between Q and T included in S. 

Problem. Maximum Chaining Problem (MCP): 

Input: A pair (Q,T) of ordered rooted trees, a set S = {P°, . . . , p m_1 } of m 

possibly overlapping seeds between Q and T, and a scoring function v on the 

seeds P\ 

Output: The maximum score chain C included in S. 

MCP(Q, T, S) = max{v(C*); C G C S (Q, T)} 

Fig. [1] shows an instance of the MCP problem with 6 seeds. 

Remark 1. The notion of mapping extends naturally to ordered forests. Hence, 
if S is a set of seeds between two forests Pi and P2 such that each seed is a seed 
between a tree of Pi and a tree of P2 , then the MCP can naturally be extended 
to ordered forests. 




Fig. 1. An instance of the MCP problem with 6 seeds: P° = {(2, 10), (3, 11)}, 
P l = {(6,3)}, P 2 = {(9,5)}, P 3 = {(10, 6), (11, 7)}, F 4 = 
{(7, 4), (11, 7), (12, 8)}, P 5 = {(3,1), (13, 9), (14, 11)}. If «(P*) = |P*| for every 
seed, an optimal chain is composed of {P 1 , P 2 , P 4 , P 5 } and has score 8. 

Remark 2. To compare with chaining algorithms for sequences, we represent 
a sequence u — (uq, . . . ,u n -i) by a unary tree, rooted at a node labeled by 
u„_i, where every internal node has a single child and uq is the unique leaf: the 
sequence of nodes visited by the postfix-order traversal of this tree is exactly u. 

Motivation and background. As far as we know, [7. is the only work that at- 
tacks the MCP in tree structures, although the authors describe the problem 
in terms of arc-annotated sequences. They proposed a dynamic programming 
algorithm to solve the maximum chaining problem with some restrictions on 
the seeds (precisely, seeds are maximal exact pattern common to the consid- 
ered sequences) . This dynamic programming technique is different from, and in 
fact simpler than, the approaches used for the Maximum Chaining Problem in 
sequences, and, when applied to arc-annotated sequences with no arc (i.e. se- 
quences) it can be shown this algorithm has a worst-case time complexity in 
0(m 2 ), where m is the number of seeds (see Appendix). 

Our main result is an algorithm that solves the Maximum Chaining Problem 
with a better complexity than the algorithm of [7] . After a preprocessing of the 
m input seeds of S, that can be done in time OdlS 1 )!) (we discuss in appendix the 
complexity of this preprocessing) , our algorithm solves the Maximum Chaining 
Problem in 0(||5|| log(||5||) + Tn\\S\\ log(m)) time and ©(mils'!!) space. In par- 
ticular, when applied on sequences (Remark [5]), our algorithm has the same 
complexity than the best known algorithms for chaining seeds in sequences [12] , 
that is 0(mlog(m)) in time and O(m) in space. 

Remark 3. Without lost of generality, from now we assume that the seeds P % 
are sorted increasingly according the postfix number of their roots in Q, that is: 
tq o — ' ' ' fs r Q i — ' ' ' fs r Q m-i • F° r a gi ven chain C, the last seed of C is 
then the seed with the highest postfix index in Q. 



3 Combinatorial properties of seeds and chains 

We first describe combinatorial properties of seeds and chains, that naturally 
lead to a recursive scheme to compute a maximum chain. Indeed, we show that 



given a chain C and its last seed P, the root and border of P define a partition 
of both Q — Qp and T — Tp into pairs of forests that contain the seeds C — {P} 
and form sub-chains of C. 

Definition 4. Let P be a seed on two trees Q and T and (a, b; c, d) be a quadru- 
ple such that 1{tq p ) < a < b < tq p , l{rT P ) < c < d < rx P and the pair of forests 
(Q[a, b],T[c, d]) is free of P's nodes (Q P n Q[a, b] = and T P n P[c, d] = 0). 

1. (a, 6; c, d) is called P -left-maximal if one of the following assertion is verified: 
(1) a = 1{tq p ) and c = 1{tt. ), (2) there exists (o;,y) € P such that x G Q a -i 
and y G T c _i. 

2. (a, 6; c, d) is called P -right-maximal if one of the following assertion is verified: 
(1) b = r-Q — 1 and d = rx — 1, (2) there exists (#, y) € P such that 6+ 1 G Qx 
and d + 1 G Tj, . 

For example, in Fig. [TJ let us consider the seed P — P 5 ; then, (4, 12; 2, 8) is 
P- left-maximal as (3, 1) G P, 3 is the root of Q3=4-i and its image 1 in T is the 
root of T\—2-\. Since (13, 9) is in P, Q13 contains 12 + 1 and Tg contains 8 + 1, 
assertion (2.2) of Definition [4] is also verified and (4, 12; 2, 8) is P-right-maximaJfl. 

Definition 5. Let (x, y) G B(P) for a seed P between Q and T. We define by 
F(x,y) — {(ai,bi;ci,di)} the set of all quadruples in [l(x); x[ 2 x[l(y); y[ 2 that are 
both P-left-maximal and P-right-maximal and such that there is no border node 
of P in Q (resp. T) on the path from b to x (resp. d to y). We call this set the 
chainable areas of the border node x. 

For example, let us consider a pair (x,y) in L(P) such that x and y are 
not a leaf of respectively Q and T, then F(x, y) represents the couple of forests 
Q x and %, F(x,y) = {(l(x),X - l;l(y),y - 1)}. In Fig. HJ with P = P A and 
(x, y) = (11, 7), F(x, y) = {(8, 10; 5, 6)}); if (x, y) - (14, 11) G P(P 5 ) - L(P 5 ), 
F(s, y) = {(0,1; 0,0), (4, 12; 2, 8)}. 

Definition 6. The chainable areas of a seed P, denoted by CM.(P), is the union 
of the sets of quadruples F(x, y) for all pairs (x, y) G B(P). 

Notation. For a seed P (resp. chain C) and a chainable area (a, 6; c, d), we say 
that P G (a,b;c,d) (resp. C C (a,b;c,d)) if, for every (x, y) G P (resp. (x, y) in 
a seed of C), x G Q[a, 6] and y G T[c, d]. 

The following property is a relatively straightforward consequence of the 
definitions of seeds and chainable areas. 

Property 1. Given a seed P between trees Q and T, |CA(P)| < 2 x |P(P)| + 1. 

The next property describes the structure of any chain between two forests 
Q[a,b] and T[c,d] included in a set of m seeds S = {P°, . . . , P™ 1-1 }. It is a 



5 Additional figures illustrating the definitions of this section are available in the Ap- 
pendices. 



direct consequence of the constraints that define a valid mapping and the fact 
that seeds are non-overlapping in a chain. 

From now, for every {x,y) of a seed P 3 , we denote by x J the unique node 
y of T associated with x in P° . We also denote by Fj (x) the set of quadruples 
F(x,x J ) for the pair of nodes (x, X 3 ) G P J . 

Property 2. Let P 3 be the last seed of a chain C included into two forests Q[a, b] 
and T[c, d}. 

1. C can be decomposed into |CA(P- 7 ')| + 2 (possibly empty) distinct sub-chains: 
P 3 itself, |CA(PJ)| chains: for each (e,f;g,h) 6 CA(P 3 ) a (possible empty) 
chain included into Q[e,f] and T[<?, h] and a chain included into the forests 
Q[a,l(rj)-l] andT[c,Z(rj)-l]. ' 

2. Moreover, C is a chain of maximum score if all of its sub-chains described 
above are maximal. 

Property [5J2 naturally leads to a recursive scheme to compute an optimal 
chain between two forests Q[a, b] and T[c, d] that ends by the last seed of a set. 
If MCP'(Q[a, b],T[c, d], {P° . . . P 3 }) is the score of a maximum chain between 
Q[a, b] and T[c, d] and that contains P J : 

MCP'(Qla,b],T[c,d],{P ...Pi})= (1) 

(0 if P J '(/_ {a,b;c,d), 

v(n + E MCP(Q[eJ],T[ g ,h], { P ...P^}) otherwise , 

(e,f-g,h)eCA(Pi) 

+MCP(Q[a, l(r 3 ) - 1], T[c, l(rj) - 1], {P° . . . P^ 1 }) 
and thus MCP(Q,T, S) can be computed using MCP' as follow^: 

MC"P(0[a,6],T[c,d],{P ...P J }) = max MCP'{Q[a, b],T[c, d], {P° . . . P 1 }) (2) 

i=Q...j 

MCP(Q,T,S) = MCP(Q[0,r Q },T[0,r T },S) (3) 



The main challenge in designing an algorithm for the MCP is then to imple- 
ment efficiently this recursive formula, that was already central in the dynamic 
programming algorithm of [7]. In Section 31 we will rely on the fact that for every 
seed P J , CA(P : >) and, for every border node x of P J , Fj(x), have been computed 
during a preprocessing phase. We discuss in Appendix the issues related to this 
preprocessing and we show that it can be done in (9(||5||) time and space. 

4 Algorithms for the Maximal Chaining Problem 

From now, we consider that we are given two ordered trees Q and T, a set 

5 = {P°, . . . , P m_1 } of seeds and a scoring score v on S. Furthermore, we 



We remind that the seeds are supposed to be sorted incrementally (see Remark [3]). 



assume that the score v(j) of a seed P 3 can be accessed in constant time and 
the seeds of S are given as a list I of triples (i, f,j) such that: (1) i is the postfix 
number of either the root of Pk or a border node of Pk (ie. i G B(Pq) U r(Pk)) 
and (2) / is a flag indicating if i is either border (/ = 0) or root (/ = 1) for Pk. 
Thus if i is both in B(Pk) and the root of Pq then i appears in two distinct 
triple^. Moreover, for a node i in Q belonging to a seed P J , we assume that the 
corresponding node in T, V (or more precisely its postfix number in T) can be 
accessed in constant time. Finally, for every node i in Q and T, its leftmost leaf 
l(i) is also supposed to be accessed in constant time. 

As a preprocessing, / is sorted in lexicographic order. Thus, if a node is both 
in the border and root of P 3 , it first appears in / as a border, then as a root. 
This sorting can be done in 0(||5|| log(||5||)) time. In our algorithms, we visit 
successively the elements of / in increasing order, and a seed P J is said to be 
processed after its root has been processed (i.e. the current element of/ is greater 
than (j-j, l,j) for the order defined above). 

In the following, we first introduce a simple but non optimal algorithm to 
compute the MCP between Q and T which does not require any special data 
structure. In a second step, we will present a more efficient method based on a 
simple modification of this algorithm. 

4.1 A simple non optimal algorithm 

In order to compute in constant time the partial MCP for any pair of forests in 
CA(P J ) as described in equation ([J), we introduce a data structure M indexed 
by quadruples of integers (a, b; c, d) defining the forests Q[a, b] and T[c, d). These 
quadruples (a, b; c, d) belong to a set Y = Y± U Y% U I3 defined as follows: 

m— 1 

Y 1 = |J CA(Pi), Y 2 = {(0,r Q ,0,r T )}, 



Y 3 = {(a,Z(rj-)-l;c,Z(r?)-l) s.t. 3(6, d) s.t. (a,b;c,d) G Y 1 UY 2 and F C (a,b;c,d)} 

In algorithmQ] the function Update allows to replace the value of M[a, b, c, d] 
by a real number w if w is greater than M[a, b, c, d]. We also use an array V of 
m integers to store the intermediate quantities of MCP . The correctness of the 
algorithm relies on the following invariants for the two data structures V and 
M, that we prove later: 

Ml. After P 3 has been processed, then M[a, b,c, d] = MCP(Q[a,b],T[c,d], 

{P°, . . . , Pi}) for every (a, 6; c, d) G Y. 
VI. After P 3 has been processed, then V[j] = MCP'(Q, T, {P°, . . . , P 3 }). 



7 Hence, we do not require as input the whole seeds mappings but just the borders 
and roots of the seeds, as it is usual when chaining seeds in sequences. 



Algorithm 1 MCP\\ compute the score of a maximal chain. 

1 for j from to m — 1 do V[j] — v(j) 

2 foreach (a, b; c,d) G Y do M[a, b, c, d] = 

3 foreach (i,f,j) in J do 

4 if / = then # i.e. (i,f) e B(P j ) 

5 foreach (a, &; c, d) G Fj (i) do V[j] = V[j] + M[a, b, c, d] 

6 else # i.e. / = 1 and i is the root of Q P j , i — Tj 

7 foreach (a, b; c, d) G Y\ U Yb S.t. P J C (a, 6; c, d) do 

8 Update M[o, 6, c, d] with w = V[j] + M[a, lijj) - 1, c, £(rj) - 1] 

9 foreach P 9 C (r, + 1, 6;rj + 1, d) do 

10 Update M[o, J(r 9 ) - 1, c, Z(rf ) - 1] with iw 

11 ^[j] = U[j] + M[0, l( rj ) - 1, 0, J(rJ) - 1] 

12 return maxj V[j] 

Correctness of the algorithm. Obviously, VI implies that maxj V[j] contains the 
score of the maximal chain (equations ([5J and ([3])). Let us assume now that Ml 
is satisfied. If the seed P J has been processed, then V[j) contains the sum of v(j) 
(line[T]), the MCP scores of the chainable areas of all its border nodes (line [5]) 
and the MCP score between forests Q(0, 1 fa) - 1) and T(0, l(rj) - 1) fline[TT|). 
From Property [5] and ©, V[j] = MCP'(Q, T, {P°, . . . , Pi}) and VI is satisfied. 
We prove Ml by induction. Initially, since no seed has been processed, line [2] 
ensures that Ml is satisfied. Now let us assume that Ml is satisfied for all 
processed seeds {P°, . . . , P^ 1 } and the input (i, l,j) is being processed. If P J <f_ 
(a, b; c, d), then by induction, Ml is satisfied for M[a, b, c, d]. Otherwise, the loop 
in lines [7] and [5] ensures that Ml is satisfied for all entries M[a, b, c, d] such that 
(a, b;c, d) £ Y\ U Y 2 , as (a,l(rj) — l;c, i(H) — 1) does not contain P- 7 ; thus by 
induction Ml is satisfied for this index. Finally, the loop in line [9] update all 
(a, b; c, d) £ Y 3 including P J ,and Ml is satisfied for all entries of M. 

Complexity analysis. From Property [TJ the space required to encode the entries 
of M indexed by Y\ is in 0(||S||). The space required to encode the entries of 
M indexed by I3 is in 0(m 2 ), as for every pair of seeds P % and P J , there is at 
most one chainable area of CA(P % ) that contains PK 

We now address the worst-case time complexity. We do not factor the prepro- 
cessing required to compute the Fj and CA and we assume / has been sorted in 
time 0(11511 log(||5||)). The amortized cost of lines @H5] is 0(||5||), as each chain- 
able area is considered once, there are Odl^l!) such areas, and we assumed we can 
access them in amortized constant time. A naive implementation of lines [oT- llll 
would require 0(m 2 \\S\\) operations: indeed, there are m iterations of the loop 
in line [6) the loop in line [7] considers only entries indexed by Y\ U Y2 (there are 
0(||5||) such entries) and the loop on line Illiterates 0(m) times. However, we can 
notice that there are 0(m) entries (a, 6; c, d) £ Y\ U Y2 such that P J C (a, b; c, d), 
and it is possible to preprocess I in time and space 0(771115*11) in such a way that 
the loop in line [7] can be implemented to perform 0(m) iterations, leading to 
a total time complexity of 0(m||5|| + m 3 + \\S\\ log(||5||)) (respectively for the 
preprocessing, the main algorithm and sorting the input). 



4.2 A more efficient algorithm 

The key ideas are to access less entries from M (while maintaining property 
Ml on the remaining entries though) and to complement M with a data struc- 
ture R that can be queried in 0(log(m)) instead of O(l), but whose main- 
tenance does not require a loop with 0(m 2 ) iterations. Formally, let X = 
{(a, c) s.t. 3(a, b; c, d) £ Y\ U Y^} and R be a data structure indexed by X such 
that for a given index (a, c) G X , R[a, c] is a set of pairs (j, s) where j is the index 
of the seed P J and s is the score of the chain in Q[a, rj],T[c, r 1 -] that ends with 
Pi . Roughly, M is used to access, still in 0(1) time, the values MCP(a, l(rj) — 
l,c, Z(rj) — 1, {P° . . .P^ 1 }) required to compute MCP' in equation ^ and 
R[a,c] is used to access, in time 0(log(m)), the scores of the best chains in- 
cluded in (Q[a, r Q ],T[c, r T \) (the values MCP(Q[e, f],T[g, h],{P°... P^ 1 }) in 
equation (JXJ) ) and replace the entries M[a, b, c, d] with (a, b; c, d) G Y\ U Y% that 
were used in the previous algorithm. 

Finally, the algorithm iterates on a list of triples J = I[j {y>T=o (^('"j), —1, j)) , 
sorted using the lexicographic order than in the previous section, with the fol- 
lowing modification: if we have two seeds P- 7 and P 9 with g > j such that 
(l(rj),l(rj)) — {l(r g ),l(r 9 )) then only (l(rj), — 1,j) occurs in J. This preprocess- 
ing requires 0(||5|| log(||5||)) time. 



Algorithm 2 MCP2{Q, T, S, v): compute a maximal chaining from S. 

1 for j from to m — 1 do V[j] — v(j) 

2 foreach (a, b; c, d) G Y3 do M[a, b, c,d] — 

3 foreach (a, c) G X do R[a, c] — 

4 foreach (i,f,j) in J do 

5 if / = -1 then #i = l(rj) 

6 foreach (a, c) G X s.t. a, c < £(rj), l(r 3 ) do 

7 M[a,Z(rj) — l,c,l(r 3 -) — 1]= value s of the last (y,s) of _R[a,c] s.t. r^ < l(r 3 -) 

8 else if / = then # (i,f) G B{P 3 ) 

9 foreach (o, b; c, d) G Fj (i) do 

10 Add to V[j] the value s of the last entry (y, s) of R[a, c] s.t. Vy<d 

11 else # / = 1 and i is i/ie rooi 0/ Qpj , i — r j 

12 foreach (a,c) G X s.t. a,c< l(rj),l(r 3 ) do 

13 w = V[j]+M[a,l( rj ) ~ l,c,l(rj) - 1] 

14 Insert entry (j, w) into i?[a, c] and update R[a, c] as follow: 

15 Find the last entry (y, s) s.t. r" < H 

16 if s < to then 

17 Insert (j, w) just after (y, s) in _R[a, c] 

18 Remove from R[a, c] all entries (z, t) s.t. rj < rf and t < w 

19 V[j] = Vtf] + M[0, J(rj) - 1, 0, l{r]) - 1] 

20 return maxj V[j] 



Correctness of the algorithm. We consider the following invariants. 
M2. After P j has been processed, then M[a,b,c,d] = MCP(Q[a,b],T[c,d], 

{P° , . . . , P J '}) for every (a, b; c,d) GY 3 . 
VI. After Pi has been processed, then V[j] = MCP'(Q,T, {P°, . . . ,P J }). 
Rl. After P J has been processed, then for all (a, c) € X, R[a, c] contains all (y, s) 

that satisfies 

a. y < j and a = MCP'(Q[a,r v ],T[c,ry], {P°, . ..,P y }). 

b. V(z, i) e P[a, c], r% <r v y ^t < s. 

R2. V(a, c) e A, P[a, c] is totally ordered as follows: (y, s) < (z, t) iff r y y <r z z . 

We first assume that Rl and R2 are satisfied. As previously, if VI is satisfied, 
then the algorithm computes MCP(Q,T,S). The initialization line [T] ensures 
that V[j] contains v(j). Next to prove VI we only need to show that when we 
process a border i of a seed P- 7 , in line [10] we add to V[j] the best chains of each 
chainable area (a,b;c,d) of the border; it follows from (1) the fact that every 
seed pi +e with e > does not belong to the forest Q[a, b] (because b < i < rj +e ) 
and thus can not belong to a chain in the (a, b; c, d) area, (2) the fact that the 
score of this chain is present in R[a,c] (from Rl) and (3) that fact it is the last 
entry (y, s) such that r^ < d (from R2). 

M2 is similar to Ml but restricted to entries M[a, b, c, d] such that (a, 6; c, d) € 
F3. To check it is satisfied, we only need to focus on line [3 as it is the only line 
that updates M. For entries M[a,b,c,d] such that a > l(rj) or c > /(rj), then 
M[a, b, c, d] = due to the initialisation in line[TJ For all other entries, M2 follows 
immediately from Rl and R2, using argument similar to the previous ones. 

Finally, we need to check that Rl and R2 are satisfied. First, as previously, 
in the case where a > l(rj) or c > ^(H), R[a,c] = which is ensured by the ini- 
tialisation in line[3] So we need only to consider the case where a,c < l(rj), ?(H), 
that is handled in lines [TT] to [TH Every seed P v such that y < j has already been 
processed and s — MCP'(Q[a, r y ],T[c, rt], {P°, . . . ,P y }) can not be modified 
after P v has been processed, so lines [T2l and IT3l together with M2, ensure that 
(y,s) has been inserted into R[a, c] previously, and the same argument applies if 
y = j. Entries (z, t) removed at linelTSldo not belong to any of these (y, s), which 
implies that Rl.a and Rl.b, and so Rl, are satisfied. R2 is obviously satisfied 
from the position where (j,w) is inserted into R[a,c] in line 1171 

Complexity analysis. The space complexity is given by the space required for 
structures M and R. M requires a space in 0(m 2 ) as it is indexed by Y 3 . R 
requires a space in 0(m||5||), as \Yi U Y 2 \ G 0(||5||) and for each seed P- 7 , 
an entry (j,s) is inserted at most once in each R[a,c\. All together, the space 
complexity is then 0(m 2 + m||5||) = 0(m||5||). 

We now describe the time complexity. First, note that following the technique 
used for computing maximal chains in sequence [619112] . the structures R[Iq, It] 
can be implemented using classical data structures such as AVL or concatenable 
queues supporting query requests, insertions, successor and, predecessor and 
deletions in a set of n totally ordered elements in 0(log(ra)) worst-case time. 



Now, we analyse the complexity of lines [5] to [7J The loop of line|5]is performed 
at most 0(m||5||) times and each iteration requires 0(log(m)) in time (line [7]), 
which gives an amortized time complexity of 0(771115*11 log(m)). 

Line[TU]is applied at most once for each of the 0(||S||) chainable area Fj(i) 
(Property!]]), and each iteration requires 0(log(m)), which gives an 0(11511 log(m)) 
amortized time complexity. 

Finally, we analyse the complexity of lines [TT] to Qj|] First, we do not consider 
the operation in line 1181 The number loop starting in line [T2] is performed in 
0(m), and the complexity of each loop is in 0(||5||). The cost of the opera- 
tions performed during each iteration is 0(log(||5||)) (lines [T3l and IT6l are both 
performed in 0(1) and lines [HI and 1 1 5 1 in time 0(log(||5||)). The total time com- 
plexity of this part, without considering line IT%1 is then 0(m||5|| log(||5||)). To 
complete the time complexity analysis, we show that the amortized complexity 
of line [TH] is in 0(m||5||). Indeed, it follows from R2 that all entries removed in 
one step are consecutive in the total order on R[a, c] defined in R2. Hence, if one 
call to line [TBI removes k elements from R[a, c], it can be done in 0((fc+2) log(m)) 
time, as the successor of a given element can be retrieved in 0(log(m)) time. 
As every element of R is removed at most once during the whole algorithm, this 
leads to an amortized complexity of 0(?77||5|| log(m)) for line[THJ Alltogether, our 
algorithm solves computes MCP(Q, T) in time 0(m||5|| log(m)), using standard 
data structures and after a preprocessing in time 0(||5|| log(||5||)) to compute 
the chainable areas and to sort J. 

Additional remarks. If we consider that Q and T are sequences, or, as described 
in Section[2] unary trees, then each of the two trees has a single leaf and each seed 
is unambiguously defined by its root and border, which implies that||5|| = m. 
There is only one R[a,c), as a = c = 0, that contains 0(m) entries. Hence, all 
loops that were iterating on R have now a single iteration, which reduces the 
time complexity by a factor m to 0(||5|| log(m)) = 0(mlog(m)). 

In the complexity analysis above, we followed the approach used for express- 
ing the complexity of chaining in sequences, as we expressed the complexity 
only in terms of the size of the seeds. To express the complexity of our al- 
gorithm in terms of the size of Q and T, a finer analysis of the data struc- 
ture R and of the number of different chainable areas leads to the following 
result: the worst-case space complexity of our algorithm is 0(\Q\ 2 \T\ 2 ) (simi- 
lar to the algorithm of Heyne et al.), and its worst-case time complexity is in 
0(||5||log(||5||) + |Q||T|log(|T|)(|Q||T| +m)), to compare with the complexity 
of the Heyne et al algorithm that is in 0(||5|| log(||5||) + \Q\ 2 \T\ 2 {\Q\\T\ + m)) 
(see details in the Appendix). This alternative complexity analysis is mostly 
of theoretical interest as in practice, for RNA analysis, one can expect that 
m<\Q\\T\. 

5 Conclusion 

The current paper describes algorithms to solve chaining problems in ordered 
trees. With respect to similar problems in sequences, these methods exhibit a 



linear factor increase both in time and space. Chains so obtained can be used to 
speed-up RNA structure comparisons, as illustrated in [7I11J . 

A natural question related to chaining problems, that, as far as we know, 
has not been considered in the case of sequences, is to decide whether a given 
seed P of a set of seeds S belongs to any optimal chains or not. However a 
trade-off between quality and speed may have to be find. Indeed, identifying 
these always optimal seeds would ensure a good quality of the chains, whereas 
the high complexity of these identifications would slow down the detection of 
similar structures in a large database. 
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Appendix: remarks on the dynamic programming 
algorithm of Heyne et al. 

This appendix proposes a comparison of the worst-case time complexity between 
chaining algorithms proposed in the current paper and in [7]. 

Time complexity of Heyne et al. algorithm [7J 

Heyne et al. algorithm [7 considers pairs (Si, S2) of arc-annotated sequences, of 
respective length ri\ and rii- 

This algorithm is based on processing holes in subsequences corresponding 
to seeds, where a hole in an arc-annotated sequence is a part of the subsequence 
spanned by a seed that does not belong to the seed. In trees, holes correspond 
to the chainable areas of the border nodes of a seed. The complexity of this 
algorithm is in 0{hn\ni) where h is the number of different holes, n\ and n 2 
corresponds respectively to the hole size in S\ and S2 ■ Due to the constraints on 
the definition of seeds proposed in [7] (connected nucleotides in RNA structure), 
authors claim a time complexity of 0(n 2 n 2 ) as h is bounded by 0(ni x n 2 ) [7]. 

Actually, this complexity do not take into account the time required to es- 
tablish the holes and to sort them (holes are treated in a specific order). Thus, 
the total time complexity of this algorithm is 0(||S|| log \\S\\ +h X ni X n-z) where 
0(||S||) is the sum of seed sizes. 

A worst-case complexity analysis In the following, we propose an analysis of the 
complexity of Heyne et al. algorithm 7 in the case of more general seeds. 

This algorithm uses dynamic programming tables indexed by holes (in fact 
pairs of holes, one in each sequence defined by a seed). Given a hole h induced 
by a seed and defined by the sequence Si[h L1 ,h R1 ] in Si and the sequence 
S 2 [h L2 7 h R2 } in S 2 , D h {j 1 l) is the best chain included in Si[h L1 ,j] in Si and 
S2[h L2 , 1] in S2. Each hole is then processed independently from the other ones 
(in an order ensuring required information have already been computed), in 
order to fill the table D , using the following dynamic programming equation: 

D h \j,l]=m>ax{D h (j-l,l),D h (j,l-l), max {D h (p - l,q - 1) + S P }} 

seed PQh 
st.P ends on j,l 

S P =v(P)+ Y, D h (h R \h R2 ) 

fee Holes of P 

where, p (resp. q) is the first base of seed P in Si (resp. S2) and Sp is the score 
of the best chain included in the subsequences spanned by P in Si and S2 and 
ended by P. 

We can easily transpose this recurrence on trees, using article notation, as 
follow: 

D h (j-l,l), 

D h=(.a,b,c,d)r i ,1 = J D h {j,l- 1), 

■'" J ^ -^ sccdPciwl) {D h (l(rp)-lJ(r R )-l) + S P } 

st.(rp,r£)=(;M) 



S P = v{P)+ ^ D^ b ' c ^[b,d] = MCP / (Q rp ,T r P,P°- p ) 

(a,b,c,d)eCA(P) 

First, let us remark that the computation of one dynamic programming ma- 
trix can be done in 0{n\n 2 + m) as the matrix has at most n\ x n 2 entries and 
the search of the seeds P which ends on j, I requires a pre-processing in 0(m). 

Thus, assuming that holes have already been computed, the total time com- 
plexity is 0(115*11 log(||5||) + h x (nin 2 + m) + \\S\\) (ie. complexity of sorting of 
the holes plus the computation cost of D h plus the computation cost of Sp). 

In [7], authors design seeds that are connected nucleotides in the RNA 
secondary structure either by backbone bond or base-pair bond. Hence, h is 
bounded by n\n 2 and the worst-case time complexity is 0(||5|| log(||5||) + nin 2 x 
( ni n 2 +m) + \\S\\). 

If we impose seed nodes to be connected in the trees (and not in RNA), which 
is a special case of our seeds but different from the seeds developed by Heyne et 
al., the number of different holes would be 0(n\n\) in the worst case (all possible 
quadruplets (a, b, c, d)). The overall complexity of the dynamic programming 
algorithm then becomes in the worst case: 

0(||S||log(||5||) + n?n§(nin 2 + m) + ||5||). 



Time complexity of Algorithm 2 

To establish the worst-case complexity of Algorithm |2J we have to study the cost 
of the algorithm for each / values. To ease the reading, we denote by m the size 
of Q and n 2 the size of T. Without loss of generality, we furthemore assume that 

n2<m. 

Following invariants Rl and R2, each list of R contains at most min(m,n 2 ) 
elements, as there are no (y , s) , (y' , s') e R[a,c] s.t. r^ = r v ,, and \X\ < 
min(\\S\\,mri2). Thus, in the worst-case, we have at most 0{n\n\) different 
chainable areas, \R\ — Ofoinz), for all (a,c): \R[a,c]\ = 0(712) and \X\ = 
0{nin 2 ). 



/ = — 1 line[5] Over the whole execution of the algorithm each M [a, I (rj ) — 1 , c, I M ) — 
1] is computed only once for all possible quadruplets as there is no (i,/, j), 
(i'tf'if) G J sucn that (l(rj),l(rj)) = (l(rj>), l(r 3 j, )). Each computation 
require a search in R[a,c] that can be done in 0{log{n2))- Thus, the total 
time complexity for this case is 0(n\n 2 K log(n2)) ■ 

/ = line HJ The computation line [10] can be store in a dedicated array M' such 
that the best chain of the area (a, b, c, d) is computed only once. Thus, over 
all the execution of the algorithm, each different chainable area requires a 
search into a R[a, c) and the total time complexity for this case is OdlSH + 
nfn|log(n 2 )). 

/ = 1 line 111! This case is run once peer seeds, so 0(m) times. Each run cost 
0(nin2log(n2)) and the total time complexity is 0{mnxn- 2 \og{n^)). 



From above, we conclude that the worst-case time complexity of our algo- 
rithm is 

0(\\S\\ log(||5||) + n\nllog{n 2 ) + \\S\\ + n\n\ log(na) + mn x n 2 log(n 2 )) 
= 0(||5|| log(||5||) + mnj log(n 2 )(nm 2 + m) + ||5||) 
= 0(11511 Iog(||5||) + mm log(n 2 )(mn 2 + m)) 

which represents an improvement of the worst-case complexity of Heyne et al. 

algorithm [TJ. 

To conclude, we can merge the worst-case complexity analysis with the time 

complexity analysis of section 14.21 leading to the following time complexity for 

Algorithm [2j 

0( \\S\\ computing the chainable areas 

+ 1 1 S 1 1 log( 1 1 S 1 1 ) sorting the areas 

+ min(m, run?) X mm(||S'||,nin2) X log{min(m,n2)) f = —1 case 

+ ||5|| +imn(||5||,nin2) x log(min(m,n 2 )) / = case 

+m x min(|jS , ||,nin 2 )log(min(TO,n 2 )) / = 1 case 

as \X\ < min(||S'||, nin 2 ) and \R[a, c]\ < min(?rj,n 2 ) for all a,c. 



Appendix: additional figures 
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Fig. 2. Illustration of Definition [5] for a seed P (the shaded zone) and (x,y) € 
B(P) - L(P): F{x,y) = {(ao,&o,co,ci), (oi,6i,ci,ci), (a 2 ,&2,C2,c 2 )} 





Fig. 3. Illustration of the notion of chainable areas of a seed of size 5: P — 
{(xo, j/o), ■ • ■ j (iC4j 2/4)} and CA(P) contains 4 chainable areas each indicated by 
a different filling pattern. 



Appendix: Computing F(x, y) and families of seeds 

The cost of the computation of the chainable areas for the border nodes of a 
seed depends of the nature of this seed. We describe here an efficient algorithm 
that compute the F(x,y). 

Let P be a seed between two trees Q and T and let B(P) the set of pairs 
of its border nodes. For each node i of Q and T (in fact, only required for the 
border nodes of P) , we suppose that we have access to the following informations 
in 0(1): 

— l(i); the leftmost leaf of i, 

— u(i): the node with the highest index such that r(u(i)) — r(i) where r(i) is 
the right most leaf of i. 

The nodes u(i) are often referred as rightmost roots of the tree. 

In Algorithm |3l we use B instead of B{P) and we assume that B is an array 
of k pairs of nodes on Q x T, For < i < k, B[i] represents the (i + l) th pair of 
B and B[i\q is the node Q of this pair and B[i\t is the node of T of this pair. 

Algorithm [3] makes use of a stack of pair of nodes called Stack. top(Stack) 
refers to the last element inserted into Stack and similarly to B, top{Stack)Q 
and top{Stack)T are the node of Q and node of T of top(Stack). We write 
push(Stack, (x,y)) to add (x, y) to the top of the stack and pop(Stack) remove 
the last element of the stack. 

The algorithm that computes F(x, y) for all pair of border nodes (x, y) of P 
is presented in Algorithm [3J 

Description of the algorithm. In the following, a pair of border node of P is 
called shortly a pair. Pairs are traversed incrementally according to their postfix 
index. Hence, descendants are visited before parents. Remind that a seed is 
a valid mapping so ancestral and order relations between borders nodes are 
respected. Thus, if a border node is a leaf in Pq, it is also a leaf in Pr- 

Before each insertion of a new chainable area into F(x, y) we test whether 
the area is non-empty or not (cf. lines IS] fTTl [T7l HUof Algorithm [3]) . 

Let us call the direct descendants of a pair (x,y), the pairs (x',y') £ B(P) 
such that x' is a descendant of x (resp. y' is a descendant of y) and there is no 
border node of P in Q (resp. T) between x' and x (resp. y' and y). 

Except for the last pair, each time a pair is visited, it is added to the Stack 
as it is necessarily the direct descendant of a none visited pair. Note that Stack 
contains pairs sorted incrementally by their postfix index. 

Two cases must to be considered: (1) a pair is a pair of leafs in Qp, Tp 
((x,y) £ L(P)) and (2) a pair is not a pair of leafs in Qp,Tp ((x,y) £■ L{P)). 
Lines [5]-[8] correspond to the first case and do not require additional explanations. 

For the second case, the current pair (x, y) necessarily has direct descendants 
in B(P). Those descendants have been visited (lower postfix index) and thus 
are in Stack. The chainable area (a, b, c, d) (possible empty) on the right of its 
rightmost direct descendant {x',y') (a > x' and c > y') and the chainable area 
(possible empty) (a,b,c,d) on the left of its leftmost direct descendant (x",y") 



(b < x" and d < y") require a particular treatment. The possible chainable 
areas between two direct descendants are considered by the loop on lines IT51 - 
rn?l To compute these chainable areas, the following properties are used in the 
algorithm: let (x,y) G B(P) 

1. Any chainable area (a, b, c, d) of (x, y) are such that b and d are children of 
x and y and a and c are the leftmost leafs of children of x and y. 

2. By definition, for chainable areas (a, b, c, d) of (x, y) except the one on the 
right of its rightmost descendant, b and d are such that b + 1 and d + 1 are 
the leftmost leafs of a direct descendant of x and y. 

3. For chainable areas (a, b, c, d) of (x, y) except the one on the left of its leftmost 
descendant, a and c are such that a — 1 and c — 1 are children of x and y 
and are either border nodes or ancestor of border nodes. 

As Stack is sorted incrementally, the top of the stack contains the rightmost 
direct descendant of current pair. Lines ITTHT21 compute the chainable area on 
the right of this descendant. Then, loop in lines I15H15I compute the area between 
the direct descendants using the above properties. Finally, lines I2T1- |2"21 compute 
the chainable area one the left the leftmost direct descendant (that is the last 
pair (x', y') in the Stack such that x' > l(x) and y' > l(y)). Finally, remark that 
all direct descendant are now out of the Stack and are replaced by the current 
pair. 

The time complexity of this algorithm is 0(\B(P)\) as we iterate of all pair 
and each pair is added only once to the Stack. 

Note that our algorithm is general as it applies to any sets of seeds as defined 
in Definition [2j When considering restricted families of seeds, it is possible to 
design simpler, while still efficient, algorithms to compute the F{x, y) and the 
chainable areas. For example if one considers only compact seeds, i.e. seeds such 
that B(P) = L{P) for every seed P, then for each border (x,y), \F(x,y)\ = 1 
and the computation requires a time linear in the number of seeds. A discussion 
on the issue of computing chainable areas depending of the combinatorial nature 
of the considered seeds will be added in a journal version of the current work. 



Algorithm 3 F(x,y): compute the F(x,y) for a seed P 

1 sort B incrementally 

2 foreach pair (x, y) of B do 

3 F(x,y) = ® 

4 for i from to k — 1 do 

5 if i = or B[i — 1]q < /(_B[i]q) then #B[i] is a pair of leafs in Qp and Tp 

6 if 1(B[i\q) < B[i]g — 1 #B[i]q is not a leaf in Q 

7 and l(B[i] T ) < B[i] T - 1 then #B[i] T is not a leaf in T 

8 F(B[i\) = (l{B[i\ Q ),B[{\ Q - \,l(B\i\ T ),B[i] T - 1) 

9 else #B[i]Q has at least one descendant in P that is the top of Stack 

10 # the rightest chainable area of B[i]: 

11 if u{top{Stack) Q ) + 1 < B[i] Q - 1 and u(top(Stack) T ) + 1 < B[i] T - 1 

then 

12 ^ (BW) = (u(top{Stack) Q ) + 1, B[i] Q - 1, u(top{Stack) T ) + 1, S[i] T - 1) 

13 # i/ie middle chainable areas of B[i]: 

14 (x,y) = top(Stack); pop(Stack) 

15 while Stack not empty and top(Stack)c> > 1(B[i\q) do 

16 #insert the area between (x,y) and top(Stack)Q if not empty 

17 if u(top(Stack)Q) + 1 < J(cc) - 1 and u(top(Stack) T ) + 1 < J(y) - 1 

then 

18 F(S[i])+ = (u(top(Stack) Q ) + 1, Z(x) - 1, u(top{Stack) T ) + 1, i(y) - 1) 

19 (x,y) — top(Stack); pop(Stack) 

20 # i/ie leftest chainable area of B[i] 

21 if /(x) - 1 > i(B[i]o) and /(y) - 1 > /(B^t) then 

22 iW])+ = (I(B[i] Q ), {(*) - 1, J(B[i]T), I(y) - 1) 

23 if i < fc — 1 then 

24 push{Stack,()) 
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